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Abstract 

In online learning the performance of an algorithm is typically compared to the performance of a fixed 
function from some class, with a quantity called regret. Forster [12 proposed a last-step min-max algorithm 
which was somewhat simpler than the algorithm of Vovk [26) . yet with the same regret. In fact the algorithm 
he analyzed assumed that the choices of the adversary are bounded, yielding artificially only the two extreme 
cases. We fix this problem by weighing the examples in such a way that the min-max problem will be well 
defined, and provide analysis with logarithmic regret that may have better multiplicative factor than both 
bounds of Forster [12] and Vovk [26]. We also derive a new bound that may be sub- logarithmic, as a recent 
bound of Orabona et.al [2"T] , but may have better multiplicative factor. Finally, we analyze the algorithm in 
a weak-type of non-stationary setting, and show a bound that is sublinear if the non-stationarity is sub-linear 
as well. 

Keywords: Online learning, Regression, Min-max learning 



1. Introduction 

We consider the online learning regression problem, in which a learning algorithm tries to predict real 
numbers in a sequence of rounds given some side- information or inputs x f E M. d . Real- world example 
applications for these algorithms are weather or stockmarket predictions. The goal of the algorithm is to 
have a small discrepancy between its predictions and the associated outcomes yt € R. This discrepancy is 
measured with a loss function, such as the square loss. It is common to evaluate algorithms by their regret, 
the difference between the cumulative loss of an algorithm with the cumulative loss of any function taken 
from some class. 

Forster [12j proposed a last-step min-max algorithm for online regression that makes a prediction as- 
suming it is the last example to be observed, and the goal of the algorithm is indeed to minimize the regret 
with respect to linear functions. The resulting optimization problem he obtained was convex in both choice 
of the algorithm and the choice of the adversary, yielding an unbounded optimization problem. Forster 
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circumvented this problem by assuming a bound Y over the choices of the adversary that should be known 
to the algorithm, yet his analysis is for the version with no bound. 

We propose a modified last-step min-max algorithm with weights over examples, that are controlled in a 
way to obtain a problem that is concave over the choices of the adversary and convex over the choices of the 
algorithm. We analyze our algorithm and show a logarithmic-regret that may have a better multiplicative 
factor than the analysis of Forster. We derive additional analysis that is logarithmic in the loss of the 
reference function, rather than the number of rounds T. This behaviour was recently given by Orabona 
et.al |21j for a certain online-gradient decent algorithm. Yet, their bound [21] has a similar multiplicative 
factor to that of Forster [12] , while our bound has a potentially better multiplicative factor and it has the 
same dependency in the cumulative loss of the reference function as Orabona et.al [21] . Additionally, our 
algorithm and analysis are totally free of assuming the bound Y or knowing its value. 

Competing with the best single function might not suffice for some problems. In many real-world 
applications, the true target function is not fixed, but may change from time to time. We bound the 
performance of our algorithm also in non-stationary environment, where we measure the complexity of the 
non-stationary environment by the total deviation of a collection of linear functions from some fixed reference 
point. We show that our algorithm maintains an average loss close to that of the best sequence of functions, 
as long as the total of this deviation is sublinear in the number of rounds T . 

A short version appeared in The 23rd International Conference on Algorithmic Learning Theory (ALT 
2012). This journal version of the paper includes additionally: (1) Recursive form of the algorithm and 



comparison to other algorithms of the same form (Sec. |3.1[ ). (2) Kernel version of the algorithm (Sec. 3.2). 
(3) MAP interpretation of the minimization problems (Remark[I]and Remark[2]). (4) All proofs and extended 
related-work section. 



2. Problem Setting 

We work in the online setting for regression evaluated with the squared loss. Online algorithms work 
in rounds or iterations. On each iteration an online algorithm receives an instance x t £ K d and predicts 
a real value yt £ M, it then receives a label yt £ K, possibly chosen by an adversary, suffers loss ^t(alg) = 
^(VuVt) — (yt — yt) 2 , updates its prediction rule, and proceeds to the next round. The cumulative loss 
suffered by the algorithm over T iterations is, 

T 

L T (alg)=^ t (alg) . (1) 
t=i 

The goal of the algorithm is to perform well compared to any predictor from some function class. 

A common choice is to compare the performance of an algorithm with respect to a single function, or 
specifically a single linear function, /(x) = x T u, parameterized by a vector u £ R d . Denote by £t(vL) = 

2 



(x/u— y t ) the instantaneous loss of a vector u, and by -Lt(u) = J2t ^t(u-). The regret with respect to u 
is defined to be, 

T 

i? T (u) = 5> t - y t ) 2 " Mu) • 
t 

A desired goal of the algorithm is to have -Rt(u) = o(T), that is, the average loss suffered by the algorithm 
will converge to the average loss of the best linear function u. 

Below in Sec. [5] we will also consider an extension of this form of regret, and evaluate the performance 
of an algorithm against some T-tuple of functions, (u 1; . . . , Uj) € M. d x • • • x M. d , 

T 

i? T (ui,...,u T ) = ^2(y t - Vtf - £t(ui, . . . ,u T ) , 

t 

where Lt(ui, . . . , ut) = ^*( u «)- Clearly, with no restriction of the T-tuple, any algorithm may suffer a 
regret linear in T, as one can set u t = x t (y t / ||x t || 2 ), and suffer zero quadratic loss in all rounds. Thus, we 
restrict below the possible choices of T-tuple cither explicitly, or implicitly via some penalty. 

3. A Last Step Min-Max Algorithm 

Our algorithm is derived based on a last-step min-max prediction, proposed by Forster j!2] and Takimoto 
and Warmuth [24.. See also the work of Azoury and Warmuth pQ. An algorithm following this approach 
outputs the min-max prediction assuming the current iteration is the last one. The algorithm we describe 
below is based on an extension of this notion. For this purpose we introduce a weighted cumulative loss 
using positive input-dependent weights {a*}^^, 

T T 

L T( u )=^2 a t(yt-u T x t ) , L^(u ll ...,u T ) = ^2a t (y t -uJx t ) . 

t=i t=i 

The exact values of the weights at will be defined below. 
Our variant of the last step min-max algorithm predict^] 



ijT = arg mm max 

Vt Vt 



(2) 



$> t - ytf - inf (b||u|| 2 + L?(u) y 
_t=i 

for some positive constant b > 0. We next compute the actual prediction based on the optimal last step 

min-max solution. We start with additional notation, 

t 

A t = bl + a s x s xj eR dxd (3) 

s=l 

t 

b t = J2 a °ys x s e Rd ■ ( 4 ) 

8=1 

The solution of the internal infimum over u is summarized in the following lemma. 



1 yx and yx serves both as quantifiers (over the min and max operators, respectively), and as the optimal values over this 
optimization problem. 
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Figure 1: An illustration of the minmax objective function G(yx , Vt) 0- The black line is the value of the objective as a 
function of yj< for the optimal predictor yx- Left: Forster's optimization function (convex in j/t)- Center: our optimization 
function (strictly concave in yx, case 1 in Theorem]^. Right: our optimization function (invariant to yx, case 2 in Theorem^. 

Lemma 1. For all t > 1, the function f (u) = b ||u|| 2 + X)s=i °s {Vs ~ u T x s ) 2 is minimal at a unique point 
u t given by, 



ut = A t x b t and /(u t ) = ^ a s y 2 s - bj A t x b t 



(5) 



8=1 



Proof : From 



/(u) 



= &||u|| 2 + ^a s (y s - u T x s ) 2 

8=1 

8=1 8 = 1 V 8=1 / 

t 

©,© ^a s y 2 -2u T b t + u T A t u 



s=l 



it follows that V/ (u) = 2A t u — 2b f , A/(u) = 2A t . Thus / is convex and it is minimal if V/ (u) = 0, i.e. 



for u = A t 1 h t . This show that u t = A t 1 b t and we obtain 



/ (u t ) = / (A^b,) = a sV 2 s - 2b7Ar 1 b t + b^A^AtA^bt = £ a s y 2 s - b^A^b* 



Remark 1. The minimization problem in Lemma^can be interpreted as MAP estimator of u based on the 
sequence {(x s , y s )}* =1 in the following generative model: 

u ~ iV(0,a 2 l) 
y a - N(xju,a 2 s ) , 

where a? = ^ and er 2 = . 



Under the model we calculate, 



U-MAP 



argmaxP(u | {x s },{y s }) 



arg max 



arg mm 



P(u)J[P(y s |u,x s ) 

s=l 

t 

-logP(u)-^logP(y s | u,x s ) 



(G) 



-By cwr gaussian generative model, 
-logP(u) 
- logP(2/ s | u,x s ) 

Substituting in ^ we get 



log(2W) d/2 
log(2^) 1/2 



1 
1 

27! 



UmAP = arg mm 



2a* 



- 1 



x c u 



s=l 



and &y using = &j 20^ = ° s we ^ e ™™ m * 2:a ^ 0?1 problem of Lemma^lj 

Substituting ([5| back in |2]) we obtain the following form of the minmax problem, 

minmaxG(y T ,y T ) for G(y T ,y T ) = a(a T )y T + 2/3(a T ,y T )y T + y T 

Vt Vt 



(7) 



for some functions a(ax) and /3(ax,yT)- Clearly, for this problem to be well defined the function G should 
be convex in yx and concave in yx- 

A previous choice, proposed by Forster [12], is to have uniform weights and set at = 1 (for t — 1, . . . , T), 
which for the particular function a(ax) yields a(ax) > 0. Thus, G(yx,yT) is a convex function in yx, 
implying that the optimal value of G is not bounded from above. Forster [12| addressed this problem by 
restricting yx to belong to a predefined interval [— Y, Y], known also to the learner. As a consequence, the 
adversary optimal prediction is in fact either yx = Y or yx — — Y, which in turn yields an optimal predictor 
which is clipped at this bound, yx = clip (bJ_ 1 A^ 1 XT, Y^j , where for y > we define clip(ir,y) — x if 
\x\ < y and c\vp(x,y) = ysign(cc), otherwise. 

This phenomena is illustrated in the left panel of Fig. [I] (best viewed in color) . For the minmax opti- 
mization function defined by Forster |12j . fixing some value of yx, the function is convex in yx, and the 
adversary would achieve a maximal value at the boundary of the feasible values of yx interval. That is, 
either yx = Y or yx — —Y, as indicated by the two magenta lines at yx — ±10. The optimal predictor yx 
is achieved somewhere along the lines yx — Y or yx = —Y. 

We propose an alternative approach to make the minmax optimal solution bounded by appropriately 
setting the weight ax such that G{yx,jjT) is concave in yx for a constant yx- We explicitly consider two 
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cases. First, set <zt such that G{yx,yT) is strictly concave in yx, and thus attains a single maximum with 
no need to artificially restrict the value of yx- In this case our function is concave in yx in the first option 
and has a maximum point, which is the worst adversary. The optimal predictor yx is achieved in the unique 
saddle point, as illustrated in the center panel of Fig. [I] A second case is to set ot such that a(ax) = and 
the minmax function G{yx,yT) becomes linear in yx- Here, the optimal prediction is achieved by choosing 
yx such that P{dTi Vt) = which turns G(yx, Vt) to be invariant to yx, as illustrated in the right panel of 
Fig. 

Equipped with Lemma [l] we develop the optimal solution of the min-max predictor, summarized in the 
following theorem. 

Theorem 2. Assume that 1 + <ztxJ, A^_ 1 xt — < 0. Then the optimal prediction for the last round T is 

y T = bf^Aj^xr . (8) 
The proof of the theorem makes use of the following technical lemma. 



Lemma 3. For all t = 1,2, ... ,T 



2 t a -i , i 1 + fltxt A t _ lXt - a t 

a t x t A t x t + l-o t = — — ^—r t • (9) 

1 + OtX t ' A t _!X t 



The proof appears in jAppcndix A We now prove Theorem [2] 



Proof: The adversary can choose any yx, thus the algorithm should predict yx such that the following 
quantity is minimal, 

/ T / T \ \ 



® max ( ^ (y t - y t ) 2 - ^a t y 2 t +b^A T 1 b T J . 
VT \t=i t=i J 



That is, we need to solve the following minmax problem 

T T 

Vt Vt 



minmax (y t - y t ) 2 - a t y 4 2 + bjA T 1 b T ■ 



t=i t=i / 

We use the following relation to re-write the optimization problem, 

b^A^bx = bJ_ 1 A^ 1 bT-i + 2aT^TbJ_ 1 A^ 1 XT + a 2 -y 2 -xjA^ 1 XT . (10) 

Omitting all terms that are not depending on yx and yx, 

minmax ( (jjx — Vt) 2 — otUx + 2aT2/TbJ_ 1 Ay 1 xj- + a^y^xjA^x-r ) . 
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We manipulate the last problem to be of form Q using Lemma [3j 

(1 + dyxj AyijXy — Oj- „ / T -1 2 I 

minmax — — — y T + 2y T (a T h T _ 1 A T x T - y T ) + y T \ , (11) 

Vt Vt \ 1 + OTXj,A T _[XT I 



where 

1 + a^xjAy^XT — o-t t -l 

a(a T ) = — T a -i and P\ a T,VT) = a T h T _ 1 A T x T - yr ■ 



We consider two cases: (1) 1 + arxjAy^XT — ay < (corresponding to the middle panel of Fig. |l 
and (2) 1 + citxJA^CjXt — «r = (corresponding to the right panel of Fig. lk, starting with the first case 



1 + axxjAyijXT - a T < . (12) 
Denote the inner-maximization problem by, 

1 + dyxj A/ji^x^ — ay „ / t — i \ 2 

/ {Vt) = — =f— i V T + 2 Vt (a T h T _ 1 A T x T - y T ) + y T . 

1 + ayXyA^^jXr 



This function is strictly-concave with respect to yx because of (12 1. Thus, it has a unique maximal value 
given by, 

a T 2a T bf_ 1 Ay 1 x T (l + a T xfA; r i 1 x T ) , 

J (Vr) - -— T a -i Vt ^ r~i t a -i Vt 

1 + ayXj,A T _ 1 xr — aT 1 + OyXyA T1 Xx — ot 

(aTbJ_ 1 Ay 1 xr) 2 (l + arxj Ay^Xy) 
1 + ayxJA^_,Xj' — 



Next, we solve min^ T f max (yT), which is strictly-convex with respect to yr because of (12). Solving this 
problem we get the optimal last step minmax predictor, 

ijt = bJ_ 1 A^ 1 XT (l + arXyAy^xr) . (13) 

We further derive the last equation. From ^ we have, 

Ay 1 arXTxjA^i 1 = A^ 1 (A T — A T -i) A^i x = A^ij — A^ 1 . (14) 



Substituting (14) in (13) we have the following equality as desired, 

yT = bJ_ 1 A^ 1 xr + bJ_ 1 A^ 1 aTXrxjAyi 1 XT = b T _ 1 Ay^ 1 XT . (15) 
We now move to the second case for which, 1 + ayxjAy_ 1 XT — ay = , which is written equivalcntly 



as, 



a T = ~ =^—l ' ( 16 ) 

1 — xjA T _jXT 



Substituting (16) in (11) we get 



minmax (2yx (aj-bJ_ 1 A T 1 xr — yx) + Vt) 

Vt Vt 



For i/T 7^ arbJ_ 1 A^ xt, the value of the optimization problem is not-bounded as the adversary may 
choose ut — z 2 (a^bJ^A^XT — yr) for z — > oo. Thus, the optimal last step minmax prediction is to set 
jjT = a TbJ_ x Ay Xj. Substituting = 1 + arxj Ay_-,Xx and following the derivation from (13) to (151 
above, yields the desired identity. ■ 



We conclude by noting that although we did not restrict the form of the predictor yx, it turns out that 
it is a linear predictor defined by yj- = xJpWT-i for Wx-i = ^-t-i^t-i- hi other words, the functional 
form of the optimal predictor is the same as the form of the comparison function class - linear functions in 
our case. We call the algorithm (defined using ([3|, Q and (J8|) WEMM for weighted min-max prediction. We 
note that WEMM can also be seen as an incremental off-line algorithm lj or follow-the-leader, on a weighted 
sequence. The prediction yx = x^wt-i is with a model that is optimal over a prefix of length T — 1. The 
prediction of the optimal predictor defined in ^ is xju-r_i = xjA^^bT-i = jjT, where yx was defined 
in ([§. 

3.1. Recursive form 

Although Theorem [2] is correct for 1 + axx^A^^x — a? < 0, in the rest of the paper we will (almost 
always) assume an equality, that is 

at = - * , t = l...T. (17) 

For this case, WEMM algorithm can be expressed in a recursive form in terms of weight vector w t and a 
covariance-like matrix S t . We denote w ( = A^ 1 b t and S f = A^" 1 , and develop recursive update rules for 
w t and X t : 

w f = A t _1 b t 




AtAxtx^w^i , a t y t A t \x t xJ A t } lXt 

= W t _i — — -r . _i h atVtA^Xt - 



W(-l 



„-l , .Ta-U 1 "W-t-l^t -1 , T A-l Y 

j/iA^xt - A^xtx^wt-i 



a t _1 + x J A 7-i*t 

W(-i + (y t - x^wt-i) A~\x t 

wt-i + (yt - xjwj-i) S t _ix t , (18) 
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and 



E3 



A t = A t _i + a t x t x t 

A t _! 



1 - x ' A 



T A-l 



x7S t _ix t 



or 



St-i — St_ix t x7S f _i . 



(19) 



A summary of the algorithm in a recursive form appears in the right column of Tabic [T] 

It is instructive to compare similar second order online algorithms for regression. The ridge-regression |13j , 
summarized in the third column of Table [I] uses the previous examples to generate a weight-vector, which 
is used to predict current example. On round t it sets a weight-vector to be the solution of the following 
optimization problem, 



w t _! = argnnn 



£( yi -x7w) 2 + 6||w|f 



and outputs a prediction y t = x7w t _i. The recursive least squares (RLS) |15j is a similar algorithm, yet it 
uses a forgetting factor < r < 1, and sets the weight- vector according to 



Wf-l 



arg mm 



,<=i 



x7w 



The Aggregating Algorithm for regression ( AAR) [57] , summarized in the second column of Table [I] was 
introduced by Vovk and it is similar to ridge-regression, except it contains additional regularization, which 
eventually makes it shrink the predictions. It is an application of the Aggregating Algorithm [26 (a general 
algorithm for merging prediction strategies) to the problem of linear regression with square loss. On round 
t, the weight-vector is obtained according to 



Wf = arg mm 



t-i 

E 



T \ 2 



fx7w) 2 + 6||w|| 2 



and the algorithm predicts y t = x7w t . Compared to ridge-regression, the AAR algorithm uses an additional 
input pair (x t , 0) . The AAR algorithm was shown to be last-step min-max optimal by Forster |12j , that is 
the predictions can be obtained by solving ^ for a t = 1, t = 1, . . . , T. 

The AROWR algorithm summarized in the left column of Table [Tj is a modification of the AROW 

algorithm [5] for regression. It maintains a Gaussian distribution parameterized by a mean w ( 6 M. d and a 
full covariance matrix S t £ M. dxd . Intuitively, the mean w t represents a current linear function, while the 
covariance matrix S t captures the uncertainty in the linear function w t . Given a new example (xt,j/ t ) the 
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algorithm uses its current mean to make a prediction y t = Xj Wt-i. AROWR then sets the new distribution 
to be the solution of the following optimization problem, 



arg mm 



Dkl (AA(w, S) II N (w t _i, S t _i)) + i (yt - w T x f ) 2 + ^ (x t T Sx t ) 

zr zr 



Crammer et.al. [S] derived regret bounds for this algorithm. 

Comparing WEMM to other algorithms we note two differences. First, for the weight-vector update rule, 
we do not have the normalization term 1 + x.JH t -iX-t- Second, for the covariance matrix update rule, our 
algorithm gives non-constant scale to the increment by x t x t r . This scale 1/(1 — x t r S 4 _ 1 x t ) is small when 
the current instance x t lies along the directions spanned by previously observed inputs {xi}* = J, and large 
when the current instance x t lies along previously unobserved directions. 







AROWR [91 [25] 


AAR [27] / 
Min-Max [T2] 


Ridge- 
Regression [13] 


WEMM this work 


Parame- 
ters 




< r,b 


< 6 


< b 


1 < 6 


Initialize 




wo = , S = 


For 
t = 1...T 




Receive an instance x t 


Output 
prediction 


yt = x t T w ( _i 


x7w t _i 
^"l+x^St-iXt 


yt = x^wt-i 


yt = x7w t _i 




Receive a correct label y t 


Update S t : 


S t "_\ + ±x t x t T 


S t -\ + x t x t T 


S t -\ + x t x t T 


v-i i x * x 7 


Update W( : 


Wt=Wt-l 

(»t- x 7 w t-i) s t-i x t 

r+x, r S(-ix t 


Wj = W t _i 

(at-x7w t _ 1 )s t _ 1 x t 

+ l+x t ' S t ix, 


Wj = W t _i 

(»i-x^w,_i)E t _ixi 
1+Xj 1 E t _ix, 


W< = W(_i 

+ [y t - xjwt-i) S t _iXt 


Output 




wt , St 



Table 1: Second order online algorithms for regression 
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3.2. Kernel version of the algorithm 

In this section we show that the WEMM algorithm can be expressed in dual variables, which allows an 
efficient run of the algorithm in any reproducing kernel Hilbert space. We show by induction that the 
weight-vector w t and the covariance matrix S t computed by the WEMM algorithm in the right column of 
Table [T] can be written in the form 



E{t) 



i=i 
t t 



- EE/ 3 . 

j=i fc=i 



,> x ; x /. 



where the coefficients cti and /8,- j depend only on inner products of the input vectors. 

For the initial step we have Wo = and So = b _1 I which are trivially written in the desired form by 
setting = and = 0. We proceed to the induction step. From the weight-vector update rule ( 18 1 
we get 



w t = 



Wt_i + (yt ~ xjw t -i) S t _ix t 

E 



(t-i) 



yt 



i=l 

t-1 



j=l k=l 



E 

i=l 
t-1 

E 



^-1) 



i=l 

°t 1] ( x " x ) ! E E ^ W«0 + ^ 



t-i t-i 



i=l 
t-1 



.j=l k=l 



t-1 



fc=i 



x t , 



thus 



(*) 



(t-1) 



fc-Etiaf*- 1 ' W^lES^ 1 ' Kxt) 



,t_l ( t _l) / T 



■ X (ift-E 



t-1 (t-1) 



(*t T *0 



i = l,...,f-l 
i = t 
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From the covariance matrix update rule ( 19 1 we get 



St — — £ f _iX t x7E t _i 

3=1 k=l 
j = l k=l 

t-1 t-1 



E E *V + 6_1i j W (EE ^iS" 1 + 
j=i fe=i / \i=i fe=i 

E E W*0 Xj + 6- 1 J f E E ^ ( x ^) ^ + b " lx 
v j=i fe=i / \j=i fc=i 

t-i t-i t-i t-i 



E E ^S" 1 W + ^ - E E E E K x x , x ™ 

j=l k=l j=l k=l 1=1 m=l 

E E ^ ( x ^) ^ - b_1 E E K x ^ - h ~ 2 ^ 

j=l k=l j=l k=l 

EE Ui* -EE fcU) ( x " x <)l ^ 

j'=l k—l L / — 1 m — 1 



E E ( x " x ,) ^ - fo_1 EE $ * 13 K x - h ' 2 ^ ■ 

fe=ij=i j=i fc=i 



thus 



/3jr } - E^i Z^/^ (x7„xt) (x7x ; ) j, k = 1, 



,t-l 



j = t , k = 1, 

fc = f , i = i, 
j = k = t 



,t-l 
,t-l 



A summary of the kernel version of the WEMM algorithm appears in Fig. [2j 
4. Analysis 

We analyze the algorithm in two steps. First, in Theorem|4]we show that the algorithm suffers a constant 
regret compared with the optimal weight vector u evaluated using the weighted loss, L a (u). Second, in 
Theorem [5] and Theorem [6] we show that the difference of the weighted-loss L a (u) to the true loss L(u) is 
only logarithmic in T or in _L<r(u). 

Theorem 4. Assume 1 + ajXj" A^jXj — a t < for t — 1 . . . T ( which is satisfied by our choice later). Then, 
the loss of WEMM, y t = bj_i A^LjXj for t = 1 . . . T, is upper bounded by, 



L T {WEMM) < inf 6||u|| +i"(u) 

ugR d V / 

Furthermore, if 1 + atxj A^-^Xt — a t = 0, then the last inequality is in fact an equality. 
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Parameter:. 1 < b, kernel function K : R d x R d — s- 1 
Initialize:. Set = and (3^ = 
For t = 1,...,T do 

• Receive an instance x t 

• Output prediction y t — Ei=i a j* ( x t> x «) 

• Receive the correct label y t 

• Update: 



„(*-!) 



„(*) 



o(t-l) 



& 1 (y* -E*=i a i' 1} ^(x t ,x ; ) 



i = 1, . . . ,i - 1 

i = t 



(20) 



8 {t) 

P 3,k 



(t-1) 



j = t , k = l,...,t- 1 

fc = t , i = i, — i 

j = k = t 



(21) 



Figure 2: Kernel WEMM 



Proof sketch: Long algebraic manipulation given in Appendix B yields, 

1 + a t xjAt\xt - a t 



4(WEMM)+ inf (b llull 2 + LfJu)) - inf f 6 ||u|| 2 + L?(u 
Summing over t gives the desired bound. 



1 + a t x7A ( _\x t 



(yt - yt) < o 



Next we decompose the weighted loss L^(\i) into a sum of the actual loss Lt( u ) an d a logarithmic 
term. We give two bounds - one is logarithmic in T (Theorem [5]), and the second is logarithmic in Lt(u) 
(Theorem [6]) . We use the following notation of the loss suffered by u over the worst example, 



S = S(u) = sup £ t (u), 

Kt<T 



(22) 



where clearly S depends explicitly in u, which is omitted for simplicity. We now turn to state our first 
result. 



Theorem 5. Assume ||x f || < 1 for t = 1...T and b > 1. Assume further that at 



i-x7A t _ lX « 



for 



t=l...T. The 



L£(u) < L T (u) + - — -S In 
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The proof follows similar steps to Forster [T3] . A detailed proof is given in Appendix C 
Proof sketch: We decompose the weighted loss, 

L£(u) = Lr(u) + ^(ot - l)4(u) < Lt(u) + 5^(a t - 1) . 



(23) 



From the definition of at we have, a t — 1 = afx t A t x t < j^yatx t A t x t (see (C.l)). Finally, following 



similar steps to Forster [T2] we have, ^t=i a t x 7^-t lx t — m ||^t| ( see (C2|) 



Next we show a bound that may be sub-logarithmic if the comparison vector u suffers sub-linear amount 
of loss. Such a bound was previously proposed by Orabona et.al [21]. We defer the discussion about the 
bound after providing the proof below. 



Theorem 6. Assume ||x t || < 1 for t = 1 . . . T, and b > 1. Assume further that 

1 



a i 



1-xJA 



for t = 1 . . . T . Then, 



L°(u) < Lt(u) 



6-1 



Sd 



1 + In 1 



Lt (u) 
Sd 



(24) 



(25) 



We prove the theorem with a refined bound on the sum J2t( a t ~ l)^t( u ) °f (23) using the following two 
lemmas. In Theorem [5] we bound the loss of all examples with S and then bound the remaining term. Here, 
instead we show a relation to a subsequence "pretending" all examples of it as suffering a loss S, yet with 
the same cumulative loss, yielding an effective shorter sequence, which we then bound. In the next lemma 
we show how to find this subsequence, and in the following one bound the performance. 



Lemma 7. Let I C {1 . . . T} be the indices of the T 
\I\ = T' and min (e / at > a T for all r € {1 . . . T}/I. Then, 

T 

5> (u) K-i)<s]T( at -i) 



largest elements of a t , that is 



Proof: For a vector v e M. T define by I(v) the set of indicies of the T maximal absolute- valued 
elements of v, and define f(v) — ^2 t ei(v) \ v t\- The function f(v) is a norm |10j with a dual norm 
g(h) = max | ||ft.||(x>, ^Jr 1 !- From the property of dual norms we have v ■ h < f(v)g(h). Applying this 
inequality to v = (a± — 1, . . . , ax — 1) and h — (£i (u) ,...,£? ( u )) we get, 

E ^ ( u ) ^ - !) ^ max { 5 ' E ^ t(u) ) E(«* - !) • 

*=i I J tei 

Combining with ST' = S \j2t=i ( u ) /S > X^tli ^* ( u ); completes the proof. ■ 
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Note that the quantity a t i s based only on T' examples, yet was generated using all T examples. 

In fact by running the algorithm with only these T" examples the corresponding sum cannot get smaller. 
Specifically, assume the algorithm is run with inputs (xi, j/i), . . . , {^TtUt) an d generated a corresponding 
sequence (ai, . . . , ax). Let / be the set of indices with maximal values of a t as before. Assume the algorithm 
is run with the subsequence of examples from / (with the same order) and generated a±, . . . , ax (where we 
set at = for t ^ I). Then, at > a t for all t £ I. This statement follows from Q from which we get that 
the matrix A t is monotonically increasing in t. Thus, by removing examples we get another smaller matrix 
which leads to a larger value of a t . 

We continue the analysis with a sequence of length T' rather than a subsequence of the original sequence 
of length T being analyzed. The next lemma upper bounds the sum a t over T' inputs with another 
sum of same length, yet using orthonormal set of vectors of size d. 

Lemma 8. Let X\, . . . ,X r be any r inputs with unit-norm. Assume the algorithm is performing updates using 



(24 1 for some Aq resulting in a sequence a±,...,a T . Let E = {vi, . . . ,Vd} C Mr be an eigen- decomposition 
of Ao with corresponding eigenvalues Ai, . . . , A^. Then there exists a sequence of indices j\, . . . ,j r , where 
ji G {1, . . . , off, such that ^ t a t < ^ t at, where at are generated using ( |24[ ) on the sequence Vj 1 , . . . , Vj T . 

Additionally, let n s be the number of times eigenvector v s is used (s = 1, . . . , d), that is n s = \{jt : jt = 
s}\ (and^2 s n s = t), then, 

d n s ^ 

V a t < t + V V 

^ ^ X s + r - 2 



Proof: By induction over r. For r — 1 we want to upper bound a\ = 1/(1 — x 1 r A " 1 Xi) which is maximized 
when xi = Vd the eigenvector with minimal eigenvalue Ad, in this case we have a\ = 1/(1 — 1/^d) = 
1 + l/(Arf — 1), as desired. 

Next we assume the lemma holds for some r — 1 and show it for r. Let X! be the first input, and let {7^} 
and {u s } be the eigen-values and eigen-vectors of Ai = Ao + aixix^. The assumption of induction implies 
that X)t=2 a t — ( r — 1) + Es=i X)r=i ~ +V-2 - From Theorem 8.1.8 of [14] we know that the eigenvalues of 
Ai satisfy 7 S = A s + m s for some m s > and ^ s m s — 1. We thus conclude that 



d n s 1 

^a t <l + l/(A d -l) + (r-l)+^£ 



' ' ^-f A s + m s + r - 2 

t s — l r—1 

The last term is convex in mi , . . . , nid and thus is maximized over a vertex of the simplex, that is when 
•nik — 1 for some k and zero otherwise. In this case, the eigen-vectors {u s } of Ai are in fact the eigenvectors 
{v s } of A , and the proof is completed. ■ 



Equipped with these lemmas we now prove Theorem [6j 
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Algorithm 



Bound on Regret Rt(u) 



b\\u\\ 2 + dYHn(l + l) 
b\\u\\ 2 + dY 2 \n(l + ^ b ) 
Crammer et.al. [9] r&||u|| 2 + dA\n (l + ^ 



Vovk [27] 
Forster f!2l 



rbJ 



Orabona et.al. [21] 2 ||u|| 2 + d(U + Y) 2 In (l + ^l^Y)^ 



Theorem 
Theorem 



& Hull 



6||u|| 2 + ^ln 1 + 



Table 2: Comparison of regret bounds for online regression 



Proof: Let T 

stated in (23). From Lemma [7] we get, 

T 



J2t=i &t/S . Our starting point is the equality L"(u) = -Lt(u) + Y^t=i &t ( u ) ( a t — 1) 



^4(u)(a t -l)<5^(a t -l)<5^(a t -l) , 



(26) 



t=i tej t 

where I is the subset of T' indices for which a t are maximal, and a t are the resulting coefficients computed 



with (24) using only the sub-sequence of examples x t with t £ I. 



By definition Aq = bl and thus from Lemma [8] we further bound ( 26 ) with, 



b+r-2 ' 



(27) 



T d n 3 

5>(u)K-i)<s££ — 

t=l s=l r=l 

for some n s such that ^ s n s — T' . The last equation is maximized when all the counts n s are about (as d 



may not divide T") the same, and thus we further bound (27) with, 

1 



T d [T'/d] 

J>(u) (ot-l)< s£ £ 

t=l s=l r=l 

6 



< Sdj—^ I I In 



6 + r - 2 
V 



fT'/d] 

<sdj2 



r=l 



b 1 

b-lr 



< Sd 



6-1 



1 + In 1 



L T (u) 



which completes the proof. 



It is instructive to compare bounds of similar algorithms, summarized in Table [2] Our first bouncj^] of 
Theorem[5]is most similar to the bounds of Forster (T2], Vovk [37] and Crammer et.al. [5]. Forster and Vovk 
have a multiplicative factor Y 2 of the logarithm, Crammer et.al. have the factor A — sup 1<f<T ^t(alg), and 



2 The bound in the table is obtained by noting that logdet is a concave function of the eigenvalues of the matrix, upper 
bounded when all the eigenvalues are equal (with the same trace). 
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we have the worst-loss of u over all examples (denoted by S). Thus, our first bound is better than the bound 
of Crammer et.al. (as often S < A), and better than the bounds of Forster and Vovk on problems that are 
approximately linear y t ss u • x t for t — 1, . . . , T and Y is large, while their bound is better if Y is small. 
Note that the analysis of Forster [T3] assumes that the labels yt are bounded, and formally the algorithm 
should know this bound, while Crammer et.al. assume that the inputs are bounded, as we do. 

Our second bound of Theorem [6] is similar to the bound of Orabona et.al. [21]. Both bounds have 
potentially sub-logarithmic regret as the cumulative loss T(u) may be sublinear in T. Yet, their bound has 
a multiplicative factor of {U + Y) 2 , while our bound has only the maximal loss S, which, as before, can be 
much smaller. Additionally, their analysis assumes that both the inputs x t and the labels yt are bounded, 
while we only assume that the inputs are bounded, and furthermore, our algorithm does not need to assume 
and know a compact set which contains u (||u|| < U), as opposed to their algorithm. 



5. Learning in Non-Stationary Environment 



In this section we present a generalization of the last-step min-max predictor for non-stationary problems 
given in We define the predictor to be, 



Vt 



arg mm max 

Vt Vt 



for 



Y^iVt-yt) 2 - inf _(6||u|| 2 + cKn+i|(ui, 

f * Ui,...,Ut,U \ 



v m =£;iK-u|| 



(28) 



(29) 



positive constants b, c > and weights a t > 1 for 1 < t < T. 

As mentioned above, we use an extended notion of function class, using different vectors u t across time 
T. We circumvent here the problem mentioned in the end of Sec. [2] and restrict the adversary from choosing 
an arbitrary T-tuple (ui, . . . , uy) by introducing a reference weight- vector u. Specifically, indeed we replace 
the single-weight cumulative-loss T"(u) in ^ with a multi-weight cumulative-loss I/"(u 1; . . . in (28), 
yet, we add the term cV m to (28 1 penalizing a T-tuple (ui, . . . , ux) that its elements {u t } are far from some 
single point u. Intuitively, V m serves as a measure of complexity of the T-tuple by measuring the deviation 
of its elements from some vector. 

The new formulation of (28) clearly subsumes the formulation of ^ , as if Ui = . . ,uy = u = u, then 
([28]) reduces to We now show that in-fact the two notions of last-step min-max predictors are equivalent. 



The following lemma characterizes the solution of the inner infimum of ( 28 1 over u. 
Lemma 9. For any u € M. d , the function 



J (ui, ...,u T ) = b ||u|| 2 + ||u t - u|| 2 + Y a t (yt 

t=i t=i 
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T \ 2 



is minimal for 



u, = u 



a t 1 +C" 1 ||a;t| 

for t = 1...T. The minimal value of J (ui, . . . , u^) is given by 

T 1 
Jmin = b ||u|| 2 + } — — ; 



Vt - u T x t ) x t 



t=i "t 



2 (y t -u T x t ) 2 



(30) 



The proof appears in | Appendix D| 



Remark 2. TTie minimization problem in Lemma^can be interpreted as MAP estimator o/u based on the 
sequence {(x t , j/t)}^_ 1 ifi the following generative model: 

u ~ iV(0,a 2 l) 



2/i 



iV(u,a c 2 l) 
iV(x7u t ,a 2 ) , 



where a, 



b 2b' °c 2c 

Indeed, 



and a? 



umap = argmaxP(u | {u t },{x t },{y t }) 



arg max 



= arg mm 

u 

By our gaussian generative model we have 
-logP(u) 
~logP(u t I u) 
- logP(y t | u t ,x t ) 



P(u)f[P(u t \u)f[P(y t |u t ,x t ) 

t=i t=i 

T T 

- log P(u)-J2 lo § P ( u * I fl ) - E lo § P I Ut ' x *) 



t=l 



4=1 



= log(5brag)" /J, + ^||a|| 



= log (27TCT, 



d/2 



1 II -1,2 

"u t - U 



: log (2iro- 



2^ 2 
1 



2a 



2 \Vt ~ x t UiJ 



Substituting in (31) we <?ei 



umap = arg mm 



2of 



u 



T T 



2 ^ «., 



and &y using = b, = c, 



2t» 



a t we get the minimization problem in Lemma\9\ 



Substituting ( 30 ) in ( 28 ) we obtain the following form of the last-step minmax predictor, 

T 



pT = arg mm max 

Vt Vt 



J>-y 4 ) 2 - inf lb\\u\\ 2 +J2— - 



t=i "t 



2 (yt -x^u) 2 



(31) 



(32) 
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Clearly, both equations ( 32 ) and ^ are equivalent when identifying, 

1 



(33) 



a t + c ||x t || 

Therefore, we can use the results of the previous sections. 

Corollary 10. The optimal prediction for the last round T is xjt — bJ_ 1 Ay_ 1 XT if the following condition 
is hold 

1 + arxjA^_ x Xy — ax < , 

where ot defined by ( 33 1 and where we replace ^ with 

t t ^ 
A t = bl + V a s x s .x7 = 61 + V — — — -^x.xj 

a=l 8=1 + C KH 

and Q irat/i, 

t t ^ 

bt = E a ^ x * = E 5 -i +f -iii x n^ sXs - 

8=1 S=l Qs + C H X;i l 

Although most of the analysis above holds for 1 + atxj A^\xt — a t < in the end of the day, TheoremJH] 
assumed that this inequality holds as equality. Substituting at 
obtain, 

1 



- — T i — in ( 33 ) and solving for at we 

1— x, A t l x t i r 



(34) 



l-x^A^xt-c-Mlxtir ' 
The last-step minmax predictor (28 1 is convex if a t > 0, which holds if, 1/6 + 1/c < 1 , because ^ 



Aq 1 = (1/6)1 and we assume that ||x t || < 1. 

Let us state the analogous statements of Theorem[4]and Theorem[5] Substituting Lemma|9]in Theorem[4] 
we bound the cumulative loss of the algorithm with the weighted loss of any T-tuple (ui, . . . , ur). 

Corollary 11. Assume \\x t \\ < 1, 1 + (ZtX^A^jXt — a t < for t = 1 . . . T, and 1/6 + 1/c < 1. Then, the 
loss of the last-step minmax predictor, y t = b(l 1 A^l 1 X( for t — 1 . . . T, is upper bounded by, 

L T (WEMM) < inf (6||u|| 2 + L£(u)) = inf f 6 ||u|| 2 + c V ||u t - u|| 2 + L|(ui, . . . , u T ) j . 

uGK d V / ui,...,u T ,u \ ^— ^ / 

Furthermore, if 1 + atX^A^^Xt — a t = 0, £/ien i/ie Zasi inequality is in fact an equality. 

Next we relate the weighted cumulative loss L°(ux, . . . , ur) to the loss itself Lt(u±, . . . , u-r), 
Corollary 12. Assume ||x t || < 1 /or i = 1...T, b > 1 and 1/6+ 1/c < 1. Assume additionally that 



a i 



1— A t l x t — c _1 ||x t 



as given in (34). Then 



£"(ui, . . . ,u T ) < L T (ui,...,u T ) 



6-1 



5 hi 



TS- 



1 



c(l-6- 1 ) 2 -(l-6- 1 ) 
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Proof : We start as in the proof of Theorem [5] and decompose the weighted loss, 

L"(ui, . . . ,u T ) = It(ui, . . . ,u T ) + ^2(a t - 1)4 (u t ) 

t 

< Lt(ui, . . . ,u t ) + S^(a t - 1) + S^2(a t - a*) 

t t 

We bound the sum of the third term, 



a t — at = 



1 - xjA-^Xi - c-i ||x t || 2 1 - x7A t -\ Xi 
(l - x^A-^x, - c-i ||xj 2 ) (1 - x7A-\x t ) 



< 



(35) 



(1 - 6-i - c-i) (1 - 6-1) c (i _ fo -i) 2 _ (i _ ' (36) 
Additionally, as in Theorem jHj the second term is bounded with ^^-5 In |^At|. Substituting this bound and 



(36 1 in ( 35 1 completes the proof. 



Combining the last two corollaries yields the main result of this section. 

Corollary 13. Under the conditions of Corollary the cumulative loss of the last-step minmax predictor 
is upper bounded by, 



L T (WEMM) < inf \b\\ii\\ +cV m +L T (u 1 ,...,u T ) 

Ui,.,,,Ut,U \ 



Sb 



1 



hi 



1 



TS 



c(l-6"i) -(l-6-i) 



where V m is the deviation of {u t } from some fixed weight-vector as defined in (29). Additionally, setting 
1 



b 

6-1 



) minimizing the above bound over c, 



(Sb 
b\\u\\ 2 + L T (ui, . . . ,u T ) + In 
o—l 



b 



b- 
1 + 



Few comments. First, it is straightforward to verify that cy = -^zj ( 1 + y Y~j satisfy the constraint 
1/6+1/cy < 1. Second, this bound strictly generalizes the bound for the stationary case, since Corollary 12 
reduces to Theorem [5] when all the weight- vectors equal each other Ui = . . .Ux = u (i.e. V m = 0). Third, 
the constant c (or cy) is not used by the algorithm, but only in the analysis. So there is no need to know the 
actual deviation V m to tune the algorithm. In other words, the bound applies essentially to the same last 
step minmax predictor defined in Theorem [2] Finally, we have a bound for the non-stationary case based 
on Theorem [6] instead of Theorem [5j by replacing the term 



56 



In 



with 



SM 'l + lnfl+ E ^ (Ui) 



6-1 



Sd 
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6. Related work 



The problem of predicting reals in an online manner was studied for more than five decades. Clearly we 
cannot cover all previous work here, and the reader is refered to the encyclopedic book of Cesa-Bianchi and 
Lugosi [7] for a full survey. 

Widrow and Hoff [28 studied a gradient descent algorithm for the squared loss. Many variants of the 
algorithm were studied since then. A notable example is the normalized least mean squares algorithm 
(NLMS) [21 [3] that adapts to the input's scale. More gradient descent based algorithms and bounds for 
regression with the squared loss were proposed by Cesa-Bianchi et.al. [5] about two decades ago. These 
algorithms were generalized and extended by Kivinen and Warmuth |19| using additional regularization 
functions. 

An online version of the ridge regression algorithm in the worst-case setting was proposed and analyzed 
by Foster [15] . A related algorithm called the Aggregating Algorithm (AA) was studied by Vovk [SB]. See 
also the work of Azoury and Warmuth [T] . 

The recursive least squares (RLS) [15] is a similar algorithm proposed for adaptive filtering. A variant of 
the RLS algorithm (AROW for regression [35]) was analysed by Crammer et.al. [S]. All algorithms make use 
of second order information, as they maintain a weight-vector and a covariance-like positive semi-definite 
(PSD) matrix used to re- weight the input. The eigenvalues of this covariance-like matrix grow with time 
t, a property which is used to prove logarithmic regret bounds. Orabona et.al. |21j showed that beyond 
logarithmic regret bound can be achieved when the total best linear model loss is sublinear in T. We derive 
a similar bound, with a multiplicative factor that depends on the worst-loss of u, rather than a bound Y on 
the labels. Hazan and Kale [16] developed regret bounds that depend logarithmically on the variance of the 
side information used to define the loss sequence. In the regression case, this corresponds to a bound that 
depends on the variance of the instance vectors x t , rather than on the loss of the competitor, as the bound 
of Orabona et.al. [3T] and our bound. 

The derivation of our algorithm shares similarities with the work of Forster 12J. Both algorithms 
are motivated from the last-step min-max predictor. Yet, the formulation of Forster [12] yields a convex 
optimization for which the max operation over y t is not bounded, and thus he used an artificial clipping 
operation to avoid unbounded solutions. With a proper tuning of at and a weighted loss, we are able to 
obtain a problem that is convex in y t and concave in y tl and thus well defined. 

Most recent work is focused in the stationary setting. We also discuss a specific weak-notion of non- 
stationary setting, for which the few weight-vectors can be used for comparison and their total deviation 
is computed with respect to some single weight-vector. Recently, Vaits and Crammer [25] proposed an 
algorithm designed for non-stationary environments. Herbster and Warmuth |17j discussed general gradient 
descent algorithms with projection of the weight-vector using the Bregman divergence, and Zinkevich [29] 
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developed an algorithm for online convex programming. Busuttil and Kalnishkan [3] developed a variant of 
the aggregating algorithm in the non-stationary environment. They all use a stronger notion of diversity 
between vectors, as their distance is measured with consecutive vectors (that is drift that may end far 
from the starting point). Thus, the bounds in these papers cannot be compared in general to our bound 



in Corollary 13 The filters (see e.g. papers by Simon [221 123] ) are a family of (robust) linear filters 
developed based on a min-max approach, like WEMM, and analyzed in the worst case setting. These filters are 
reminiscent of the celebrated Kalman filter [18] , which was motivated and analyzed in a stochastic setting 
with Gaussian noise. Finally, few second-order algorithms were recently proposed in other contexts [51 [5J 

MM- 



7. Summary and Conclusions 

We proposed a modification of the last-step min-max algorithm [12] using weights over examples, and 
showed how to choose these weights for the problem to be well defined - convex - which enabled us to develop 
the last step min-max predictor, without requiring the labels to be bounded. Our algorithmic formulations 
depend on inner- and outer-products and thus can be employed with kernel functions. Our analysis bounds 
the regret with quantities that depend only on the loss of the competitor, with no need for any knowledge 
of the problem. Our prediction algorithm was motivated from the last-step minmax predictor problem for 
stationary setting, but we showed that the same algorithm can be used to derive a bound for a class of 
non- stationary problems as well. 

An interesting direction would be to extend the algorithm for general loss functions rather than the 
squared loss, or to classification tasks. 
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Appendix A. Proof of Lemma [3] 
Proof : Using the Woodbury identity we get 

a-1 _ a-1 _ ^■t-i' x -txj A-t-i 
t — A t-i i . T * -i 
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therefore the left side of ^ is 

2 T.-l ., 2„T / a-1 ^-t-l x t x 7-^-t-l \ . -i 

a*x t A t xt + l-ot = a t x t A f _ 1 - T — x t + 1 - a t 

\ a t + x t A t-i x */ 

2 T * -1 a * X t r At-l X * X t r -^-i-l Xt , i 

-+x t ' A t _ lXt 
1 + a t x7A 4 l 1 1 x t - a t 
1 + ajX^A^Xt 



Appendix B. Proof of Theorem [4] 

Proof: Using the Woodbury matrix identity we get 



therefore 



a-x, = A -\x t - A r: Xt T\ A r iXt = , , ^fe ■ m 



x t A t-i x t l + "tX|A M x, 

For f = 1...T we have 

^ t (WEMM)+ inf (&||u|| 2 +L?_ ,(11)) - inf f 6 ||u|| 2 + L"(u) 

uSR d V / u£R d V 

= {yt-ytf+ inf 6||u|| 2 + VV - u T x s ) 2 - inf [b ||u|| 2 + V a s (i 

u€R d \ -~ / u£R d \ ^— f 



t-1 



(y* - yt) 2 + 1] «*y 2 - b^A^bt-x - a *y 2 s + b ^ A « lb * 



= (yt - Vt? - a t y 2 t - b7_ 1 A t l 1 1 b t _ 1 + bjA^bt 

^ (Vt - ytf - a t y 2 - bJLiA^jbt-i + b7_ 1 A t " 1 b t _i + 2a t y t b7_ 1 A t _1 x t + a 2 y 2 xj A^x t 

= (yt - ytf - a t y 2 t - b7_ x (A t -\ - A^ 1 ) b t _i + 2a^ t b t r _ 1 Ar 1 x t + a 2 y 2 xj A^x t 

^ (yt - £t) 2 - aty? - b7_ 1 A t - 1 a t x t x t r A t l 1 1 b t _i + 2a t y t b t r _ 1 A^ 1 x t + a 2 t y 2 xj A^x t 

= (Vt - ytf - a t y 2 t + a t (-y t b7_i + 2y t bJ_ 1 + a t y 2 xj) A^Xt 

B~2~| . a \2 2 / * i T r»iT 2 T\ -^i— 

(yt - yt - aty t + a t (-ytb t _i + 2y t b t _ 1 + a t y t x t ) — =— 3 — 

1 + a t x t ' A t _\x t 



, ^2 , -y 2 t ~ y 2 a t xjA t \x t - y 2 t + 2y t y t + a t yfxj A t \x t 

(yt-yt) +a t — r — T 



1 + a t x} Aj.jXt 



(yt - yt) 2 - ^ 



(yt - y t ) 



2 



1 + a t xj A t \x t 



1 + atx7A t _ 1 x < - at 



(yt - yt) 2 < 



1 + a t x t ' A t _ x x t 
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Summing over t G {1, . . . , T} and using yields i T (WEMM) - mi ueRd (b ||u|| 2 + LJ(u) J < . ■ 



Appendix C. Proof of Theorem [5] 



Proof: From (B.l) we see that A t -< A t _ x and because A = 61 we get 



x 7 A t lx * < x J A t\ x t < xjA^Xt < . . . < x7A 1 X4 = i ||x t || 2 < i 



therefore 1 < at < = ^j-. From (B.2) we have 



x7 A t *Xf = 



1 + atxjA^Xt 
1-i 

at 

1 + 



a t - 1 



2 ' 



so we can bound the term at — 1 as following 



a t - 1 = a 2 x7A t x xt < ^-yatx^Aj x x t 



With an argument similar to [[12] we have, a t x t A t x t < In 
inequality over t and using the initial value In I tAqI = we get 



^atX^Aj x xt < In 



|At| 



j A t -a t x t x7 



(C.l) 

In i l At i . Summing the last 
l A t-il ° 

(C.2) 



Substituting the last equation in (C.l) we get the logarithmic bound Y^t=i i a t ~ 1) — b^T m li Ar l , as 
required. ■ 



Appendix D. Proof of Lemma [9] 

Proof : We set the derivative of J with respect to u t to zero 

dJ 



and solve for u 4 : 



- 2c (u t - u) - 2a t (y t - ujx t ) x t = , 



u t = (cl + atxtx^) 1 (cu + a t y t xt) 



= \ c 1 ~ ~i — ^ x t x t (cu + a t y t x t ) 

V a t + c ~ 1 ll x *ll / 



a t +c 1 \\x t \\ 
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For the optimal u t of (D.l I, we compute the following two terms, which are used next, 



(yt - u7x t ) 2 = \y t - ( u 



(y t - u T x t ) x t x t 



5,- 1 + c -i |jx t || 2 



(V+^llxJ 2 ) 



(yt - u T x t ) 



(yt - u x t ) x t 



2 (yt - u T x t ) 2 ||x t || 2 



From (D.2) and (D.3) we get 



(D.2) 



(D.3) 



|u t - u|| 2 + at (y t - ujx t ) = ^ — ^2 {vt ~ ^ x tY 



Therefore the minimal value of J (ui, . . . , ut) is given by, 



T 1 
J min = &||u|| 2 + V— - 

' ' n~ L -X- 1 



t=i * 



a t +c x t 



which completes the proof. 
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